29 research outputs found

    Reconfigurable acceleration of Recurrent Neural Networks

    Get PDF
    Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency. This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs. The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs. The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs. The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection. To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces

    Optimizing Bayesian Recurrent Neural Networks on an FPGA-based Accelerator

    Get PDF
    Neural networks have demonstrated their outstanding performance in a wide range of tasks. Specifically recurrent architectures based on long-short term memory (LSTM) cells have manifested excellent capability to model time dependencies in real-world data. However, standard recurrent architectures cannot estimate their uncertainty which is essential for safety-critical applications such as in medicine. In contrast, Bayesian recurrent neural networks (RNNs) are able to provide uncertainty estimation with improved accuracy. Nonetheless, Bayesian RNNs are computationally and memory demanding, which limits their practicality despite their advantages. To address this issue, we propose an FPGA-based hardware design to accelerate Bayesian LSTM-based RNNs. To further improve the overall algorithmic-hardware performance, a co-design framework is proposed to explore the most fitting algorithmic-hardware configurations for Bayesian RNNs. We conduct extensive experiments on healthcare applications to demonstrate the improvement of our design and the effectiveness of our framework. Compared with GPU implementation, our FPGA-based design can achieve up to 10 times speedup with nearly 106 times higher energy efficiency. To the best of our knowledge, this is the first work targeting acceleration of Bayesian RNNs on FPGAs

    Accelerating Bayesian Neural Networks via Algorithmic and Hardware Optimizations

    Get PDF
    Bayesian neural networks (BayesNNs) have demonstrated their advantages in various safety-critical applications, such as autonomous driving or healthcare, due to their ability to capture and represent model uncertainty. However, standard BayesNNs require to be repeatedly run because of Monte Carlo sampling to quantify their uncertainty, which puts a burden on their real-world hardware performance. To address this performance issue, this article systematically exploits the extensive structured sparsity and redundant computation in BayesNNs. Different from the unstructured or structured sparsity in standard convolutional NNs, the structured sparsity of BayesNNs is introduced by Monte Carlo Dropout and its associated sampling required during uncertainty estimation and prediction, which can be exploited through both algorithmic and hardware optimizations. We first classify the observed sparsity patterns into three categories: channel sparsity, layer sparsity and sample sparsity. On the algorithmic side, a framework is proposed to automatically explore these three sparsity categories without sacrificing algorithmic performance. We demonstrated that structured sparsity can be exploited to accelerate CPU designs by up to 49 times, and GPU designs by up to 40 times. On the hardware side, a novel hardware architecture is proposed to accelerate BayesNNs, which achieves a high hardware performance using the runtime adaptable hardware engines and the intelligent skipping support. Upon implementing the proposed hardware design on an FPGA, our experiments demonstrated that the algorithm-optimized BayesNNs can achieve up to 56 times speedup when compared with unoptimized Bayesian nets. Comparing with the optimized GPU implementation, our FPGA design achieved up to 7.6 times speedup and up to 39.3 times higher energy efficiency

    When Monte-Carlo Dropout Meets Multi-Exit: Optimizing Bayesian Neural Networks on FPGA

    Full text link
    Bayesian Neural Networks (BayesNNs) have demonstrated their capability of providing calibrated prediction for safety-critical applications such as medical imaging and autonomous driving. However, the high algorithmic complexity and the poor hardware performance of BayesNNs hinder their deployment in real-life applications. To bridge this gap, this paper proposes a novel multi-exit Monte-Carlo Dropout (MCD)-based BayesNN that achieves well-calibrated predictions with low algorithmic complexity. To further reduce the barrier to adopting BayesNNs, we propose a transformation framework that can generate FPGA-based accelerators for multi-exit MCD-based BayesNNs. Several novel optimization techniques are introduced to improve hardware performance. Our experiments demonstrate that our auto-generated accelerator achieves higher energy efficiency than CPU, GPU, and other state-of-the-art hardware implementations

    Algorithm and Hardware Co-design for Reconfigurable CNN Accelerator

    Get PDF
    Recent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization problem due to the expensive training cost and the time-consuming hardware implementation, which makes the exploration on the vast design space of neural architecture and hardware design intractable. In this paper, we demonstrate that our proposed approach is capable of locating designs on the Pareto frontier. This capability is enabled by a novel three-phase co-design framework, with the following new features: (a) decoupling DNN training from the design space exploration of hardware architecture and neural architecture, (b) providing a hardware-friendly neural architecture space by considering hardware characteristics in constructing the search cells, (c) adopting Gaussian process to predict accuracy, latency and power consumption to avoid time-consuming synthesis and place-and-route processes. In comparison with the manually-designed ResNet101, InceptionV2 and MobileNetV2, we can achieve up to 5% higher accuracy with up to 3× speed up on the ImageNet dataset. Compared with other state-of-the-art co-design frameworks, our found network and hardware configuration can achieve 2% (~ 6% higher accuracy, 2×∼26× smaller latency and 8.5× higher energy efficiency

    Algorithm and Hardware Co-design for Reconfigurable CNN Accelerator

    Get PDF
    Recent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization problem due to the expensive training cost and the time-consuming hardware implementation, which makes the exploration on the vast design space of neural architecture and hardware design intractable. In this paper, we demonstrate that our proposed approach is capable of locating designs on the Pareto frontier. This capability is enabled by a novel three-phase co-design framework, with the following new features: (a) decoupling DNN training from the design space exploration of hardware architecture and neural architecture, (b) providing a hardware-friendly neural architecture space by considering hardware characteristics in constructing the search cells, (c) adopting Gaussian process to predict accuracy, latency and power consumption to avoid time-consuming synthesis and place-and-route processes. In comparison with the manually-designed ResNet101, InceptionV2 and MobileNetV2, we can achieve up to 5% higher accuracy with up to 3× speed up on the ImageNet dataset. Compared with other state-of-the-art co-design frameworks, our found network and hardware configuration can achieve 2% (~ 6% higher accuracy, 2×∼26× smaller latency and 8.5× higher energy efficiency

    FPGA-based Acceleration for Bayesian Convolutional Neural Networks

    Get PDF
    Neural networks (NNs) have demonstrated their potential in a variety of domains ranging from computer vision to natural language processing. Among various NNs, two-dimensional (2D) and three-dimensional (3D) convolutional neural networks (CNNs) have been widely adopted for a broad spectrum of applications such as image classification and video recognition, due to their excellent capabilities in extracting 2D and 3D features. However, standard 2D and 3D CNNs are not able to capture their model uncertainty which is crucial for many safety-critical applications including healthcare and autonomous driving. In contrast, Bayesian convolutional neural networks (BayesCNNs), as a variant of CNNs, have demonstrated their ability to express uncertainty in their prediction via a mathematical grounding. Nevertheless, BayesCNNs have not been widely used in industrial practice due to their compute requirements stemming from sampling and subsequent forward passes through the whole network multiple times. As a result, these requirements significantly increase the amount of computation and memory consumption in comparison to standard CNNs. This paper proposes a novel FPGA-based hardware architecture to accelerate both 2D and 3D BayesCNNs based on Monte Carlo Dropout. Compared with other state-of-the-art accelerators for BayesCNNs, the proposed design can achieve up to 4 times higher energy efficiency and 9 times better compute efficiency. An automatic framework capable of supporting partial Bayesian inference is proposed to explore the trade-off between algorithm and hardware performance. Extensive experiments are conducted to demonstrate that our framework can effectively find the optimal implementations in the design space

    Role of astrocytes in sleep deprivation: accomplices, resisters, or bystanders?

    Get PDF
    Sleep plays an essential role in all studied animals with a nervous system. However, sleep deprivation leads to various pathological changes and neurobehavioral problems. Astrocytes are the most abundant cells in the brain and are involved in various important functions, including neurotransmitter and ion homeostasis, synaptic and neuronal modulation, and blood–brain barrier maintenance; furthermore, they are associated with numerous neurodegenerative diseases, pain, and mood disorders. Moreover, astrocytes are increasingly being recognized as vital contributors to the regulation of sleep-wake cycles, both locally and in specific neural circuits. In this review, we begin by describing the role of astrocytes in regulating sleep and circadian rhythms, focusing on: (i) neuronal activity; (ii) metabolism; (iii) the glymphatic system; (iv) neuroinflammation; and (v) astrocyte–microglia cross-talk. Moreover, we review the role of astrocytes in sleep deprivation comorbidities and sleep deprivation-related brain disorders. Finally, we discuss potential interventions targeting astrocytes to prevent or treat sleep deprivation-related brain disorders. Pursuing these questions would pave the way for a deeper understanding of the cellular and neural mechanisms underlying sleep deprivation-comorbid brain disorders

    A multiple-antigen detection assay for tuberculosis diagnosis based on broadly reactive polyclonal antibodies

    Get PDF
    Objective(s): Detection of circulating Mycobacterium tuberculosis (M. tuberculosis) antigens is promising in Tuberculosis (TB) diagnosis. However, not a single antigen marker has been found to be widely expressed in all TB patients. This study is aimed to prepare broadly reactive polyclonal antibodies targeting multiple antigen markers (multi-target antibodies) and evaluate their efficacies in TB diagnosis. Materials and Methods: A fusion gene consisting of 38kD, ESAT6, and CFP10 was constructed and overexpressed. The fusion polyprotein was used as an immunogen to elicit production of multi-target antibodies. Their reactivities were tested. Then, the multi-target antibodies and three corresponding antibodies elicited by each single antigen (mono-target antibodies) were evaluated with sandwich ELISA for detecting M. tuberculosis antigens. Their diagnostic efficacies for TB were also compared. Results: The polyprotein successfully elicited production of multi-target antibodies targeting 38kD, ESAT6, and CFP10 as analyzed by Western blotting. When used as coating antibodies, the multi-target antibodies were more efficient in capturing the three antigens than the corresponding mono-target antibodies. By testing clinical serum, the multi-target antibodies demonstrated significantly higher sensitivity for clinical TB diagnosis than all three mono-target antibodies. Conclusion: The multi-target antibodies allowed detecting multiple antigens simultaneously and significantly enhanced TB detection compared to routine mono-target antibodies. Our study may provide a promising strategy for TB diagnosis

    Design Strategies for Aptamer-Based Biosensors

    Get PDF
    Aptamers have been widely used as recognition elements for biosensor construction, especially in the detection of proteins or small molecule targets, and regarded as promising alternatives for antibodies in bioassay areas. In this review, we present an overview of reported design strategies for the fabrication of biosensors and classify them into four basic modes: target-induced structure switching mode, sandwich or sandwich-like mode, target-induced dissociation/displacement mode and competitive replacement mode. In view of the unprecedented advantages brought about by aptamers and smart design strategies, aptamer-based biosensors are expected to be one of the most promising devices in bioassay related applications
    corecore